Serveur d'exploration sur SGML

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Finding Frequent Structural Features among Words in Tree-Structured Documents

Identifieur interne : 000998 ( Main/Exploration ); précédent : 000997; suivant : 000999

Finding Frequent Structural Features among Words in Tree-Structured Documents

Auteurs : Tomoyuki Uchida [Japon] ; Tomonori Mogawa [Japon] ; Yasuaki Nakamura [Japon]

Source :

RBID : ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9

Descripteurs français

English descriptors

Abstract

Abstract: Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W 1,W 2,...,W k ) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W 1 , W 2 ,..., W k ) is a sequence 〈t 1;t 2;...,t k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t i consists of only one node having the pair (W i ,W i + 1) as its label, or (2) t i has just two nodes whose degrees are one and which are labeled with W i and W i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.

Url:
DOI: 10.1007/978-3-540-24775-3_43


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Finding Frequent Structural Features among Words in Tree-Structured Documents</title>
<author>
<name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</author>
<author>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
</author>
<author>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-24775-3_43</idno>
<idno type="url">https://api.istex.fr/ark:/67375/HCB-3GZ3WVK7-F/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001080</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001080</idno>
<idno type="wicri:Area/Istex/Curation">000D30</idno>
<idno type="wicri:Area/Istex/Checkpoint">000927</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000927</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Uchida T:finding:frequent:structural</idno>
<idno type="wicri:Area/Main/Merge">000A07</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:04-0300385</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000033</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000149</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000020</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000020</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Uchida T:finding:frequent:structural</idno>
<idno type="wicri:Area/Main/Merge">000B09</idno>
<idno type="wicri:Area/Main/Curation">000998</idno>
<idno type="wicri:Area/Main/Exploration">000998</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Finding Frequent Structural Features among Words in Tree-Structured Documents</title>
<author>
<name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Information Sciences, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computer and Media Technologies, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Information Sciences, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s" type="main" xml:lang="en">Lecture Notes in Computer Science</title>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Data analysis</term>
<term>Data mining</term>
<term>Electronic document</term>
<term>File structure</term>
<term>HTML language</term>
<term>Information extraction</term>
<term>Latex</term>
<term>SGML language</term>
<term>Text</term>
<term>Tree structured method</term>
<term>Word</term>
<term>XML language</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Analyse donnée</term>
<term>Document électronique</term>
<term>Extraction information</term>
<term>Fouille donnée</term>
<term>Langage HTML</term>
<term>Langage SGML</term>
<term>Langage XML</term>
<term>Latex</term>
<term>Mot</term>
<term>Méthode arborescente</term>
<term>Structure fichier</term>
<term>Texte</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Document électronique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W 1,W 2,...,W k ) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W 1 , W 2 ,..., W k ) is a sequence 〈t 1;t 2;...,t k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t i consists of only one node having the pair (W i ,W i + 1) as its label, or (2) t i has just two nodes whose degrees are one and which are labeled with W i and W i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
</country>
</list>
<tree>
<country name="Japon">
<noRegion>
<name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</noRegion>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000998 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000998 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Informatique
   |area=    SgmlV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9
   |texte=   Finding Frequent Structural Features among Words in Tree-Structured Documents
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jul 1 14:26:08 2019. Site generation: Wed Apr 28 21:40:44 2021